creating appropriate corpus for information retrieval and natural language processing in persian language

نویسندگان

zahra abdolhosseini department of computer engineering, alzahra university, tehran, iran

mohammad reza keyvanpour department of computer engineering, alzahra university, tehran, iran

چکیده

persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. the provided corpora in this article are based on hamshahri dataset which is appropriate for simple information retrieval and simple natural language processing because it has not been tagged. we converted this dataset to tagged collection and increased its text quality. the new corpora minimize the text preprocessing requirement. here we have used step-1 tools for text processing and have proposed some ideas to remove the bugs of these tools in order to increase their quality. at the end we used the new corpora for text retrieval and results showed performance improvement.

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

applying natural language processing techniques for effective persian- english cross-language information retrieval

much attention has recently been paid to natural language processing in information storage and retrieval. this paper describes how the application of natural language processing ( nlp ) techniques can enhance cross-language information retrieval ( clir ). using a semi-experimental technique, we took farsi queries to retrieve relevant documents in english. for translating persian queries, we us...

متن کامل

Applying Natural Language Processing Techniques for Effective Persian- English Cross-Language Information Retrieval

Much attention has recently been paid to natural language processing in information storage and retrieval. This paper describes how the application of natural language processing (NLP) techniques can enhance cross-language information retrieval (CLIR). Using a semi-experimental technique, we took Farsi queries to retrieve relevant documents in English. For translating Persian queries, we used a...

متن کامل

Arabic Natural Language Processing for Information Retrieval

Human Language Technology has played a big role in implementing Latin based information retrieval systems. Two of the most sited techniques are stemming and truncation. Numerous studies have showed that the inflectional structure of words has a big impact on the retrieval accuracy of Latin-based languages information retrieval systems (IRS). Stemming or truncation is done for two principal reas...

متن کامل

Natural Language Processing in Information Retrieval

Many Natural Language Processing (NLP) techniques have been used in Information Retrieval. The results are not encouraging. Simple methods (stopwording, porter-style stemming, etc.) usually yield significant improvements, while higher-level processing (chunking, parsing, word sense disambiguation, etc.) only yield very small improvements or even a decrease in accuracy. At the same time, higher-...

متن کامل

Information Retrieval and Trainable Natural Language Processing

Existing work on indexing and retrieving documents from large on-line collections has had great success at treating both documents and queries as simple, unstructured collections of individual words (terms) with dependencies among these terms largely ignored. However, natural language text has a great deal of structure. In particular, at a scale close to that of the individual word, there are i...

متن کامل

Information Retrieval Using Robust Natural Language Processing

We developed a fully automated Information Retrieval System which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. In early experiments with the standard CACM-3204 collection of abstracts, the augmented system has displayed capabilities that made it clearly superior to the purely statistical base system. 1. O V E...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

عنوان ژورنال:

international journal of information science and management

جلد ۱۳، شماره ۲، صفحات ۰-۰

کلمات کلیدی

میزبانی شده توسط پلتفرم ابری doprax.com